286 research outputs found

    A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

    Full text link
    Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201

    Fast and high quality topology-aware task mapping

    Get PDF
    Considering the large number of processors and the size of the interconnection networks on exascale capable supercomputers, mapping concurrently executable and communicating tasks of an application is a complex problem that needs to be dealt with care. For parallel applications, the communication overhead can be a significant bottleneck on scalability. Topology-aware task-mapping methods that map the tasks to the processors (i.e., cores) by exploiting the underlying network information are very effective to avoid, or at worst bend, this limitation. We propose novel, efficient, and effective task mapping algorithms employing a graph model. The experiments show that the methods are faster than the existing approaches proposed for the same task, and on 4096 processors, the algorithms improve the communication hops and link contentions by 16% and 32%, respectively, on the average. In addition, they improve the average execution time of a parallel SpMV kernel and a communication-only application by 9% and 14%, respectively

    Cooperative Minibatching in Graph Neural Networks

    Full text link
    Significant computational resources are required to train Graph Neural Networks (GNNs) at a large scale, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlapping data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work per seed vertex as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems.Comment: Under submissio

    A constructive multi-way circuit partitioning algorithm based on minimum degree ordering

    Get PDF
    Ankara : The Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent Univ., 1994.Thesis (Master's) -- Bilkent University, 1994.Includes bibliographical references leaves 52-54.Circuit partitioning has many important applications in VLSI. Circuit partitioning problem can be most properly modeled as hypergraph partitioning. In this work, we propose a novel k-v/ay hypergraph partitioning heuristic using the Minimum Degree (MD) ordering which is a well-known heuristic for reducing the amount of fills in the factorization of symmetric sparse matrices. The proposed algorithm operates on the dual graph of the given hypergraph. The algorithm grows node-clusters on the dual graph which induce cell-clusters with locally minimum net-cut sizes. The quotient graph concept, widely used in MD ordering, is exploited for the sake of efficient implementation. The proposed algorithm outperforms well-known heuristics, such as Kernighan-Lin (KL) based algorithms and Simulated Annealing, in terms of solution quality on various VLSI benchmark circuits. A nice property of the proposed algorithm is that its execution time reduces with increasing k as opposed to the existing iterative heuristics. It is even faster than the fast KL-based algorithms on the partitioning of the benchmark circuits for k > 16.Çatalyürek, Ümit VM.S
    corecore